[AutoRound] Support GLM-Image W4A16 quantization model by lvliang-intel · Pull Request #3059 · vllm-project/vllm-omni

lvliang-intel · 2026-04-23T06:32:54Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Support GLM-Image W4A16 AutoRound quantization in vLLM-Omni, extending the existing AutoRound W4A16 infrastructure (originally built for FLUX and Qwen3-Omni) to the GLM-Image diffusion model. This reduces model size by ~4x and GPU memory footprint while preserving generation quality.

https://huggingface.co/Intel/GLM-Image-int4-AutoRound

Related: #1325, #1777, #2670

Key changes:
Replace all nn.Linear / ColumnParallelLinear / RowParallelLinear projection layers in the GLM-Image DiT with their vLLM quantized-aware counterparts (ReplicatedLinear, ColumnParallelLinear, RowParallelLinear with quant_config).
Also added contiguous calls before RowParallelLinear (required for FP8/W4A16 kernels) and tuple-unpacking for ReplicatedLinear output.

Test Plan

E2E offline inference tests added.
TIIF-Bench accuracy evaluation test.
DPG-Bench accuracy evaluation test.

Test Result

TIIF-Bench Accuracy (9 Sub-Attributes Average)

Model	overall-short	overall-long
glm-image-ar-w4a16	0.8175	0.8645
glm-image (BF16 baseline)	0.8277	0.8903

Summary:

W4A16 quantized model retains:
- 98.8% (short)
- 97.1% (long)
  Average accuracy drop: ~1.3% Accuracy degradation is minimal and within an acceptable range for 4-bit quantization.

Model Size Reduction

Component	BF16 Baseline	W4A16 AutoRound	Reduction
Total	~34 GB	~13 GB	~62%

Overall checkpoint is ~3.8× smaller

E2E Generation Smoke Test

✅ Text-to-Image: Functional
✅ Image-to-Image: Functional
✅ Output Quality: Valid, non-blank images
✅ Resolution: 256 × 256

The quantized W4A16 model maintains full pipeline functionality with no critical degradation in generation behavior.

Performance Test Result on A100

Metric	BF16 (Original)	W4A16 (AutoRound)	Δ
Latency Mean	60.21 s	57.98 s	-3.7%
Latency Median	60.21 s	57.96 s	-3.7%
Latency P50	60.21 s	57.96 s	-3.7%
Latency P95	60.38 s	58.20 s	-3.6%
Latency P99	60.43 s	58.26 s	-3.6%
Throughput	0.0166 qps	0.0172 qps	+3.8%
Peak Memory	23294 MB (22.8 GiB)	13680 MB (13.4 GiB)	-41.3%
Requests	64	64	—
Duration	3853 s	3711 s	-3.7%
Failed Requests	0	0	—

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106

BLOCKING:

Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:

| GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

hsliuustc0106 · 2026-04-23T09:46:03Z

please add the latency test results as well

lvliang-intel · 2026-04-23T14:04:42Z

please add the latency test results as well

Will update the performance test result soon.

lvliang-intel · 2026-04-23T14:07:56Z

BLOCKING:

Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:
| GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

Thanks for reminding me this. Doc updated.

zhumingjue138 · 2026-04-24T06:56:48Z

Please add the necessary ut test cases.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-04-25T10:43:09Z

Please add the necessary ut test cases.

ut added.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lishunyang12 · 2026-05-02T04:17:51Z

Can you try with longer seq？

lvliang-intel · 2026-05-04T12:41:27Z

Can you try with longer seq？

Sure, I will run the performance test with longer sequence.

david6666666 · 2026-05-18T06:13:16Z

Merge conflicts need fixing before review.

…o feats/ar-w4a16-glm-image Signed-off-by: lvliang-intel <liang1.lv@intel.com>

david6666666 · 2026-05-19T09:01:49Z

 | Model | Scope | Status | Notes |
 |-------|-------|--------|-------|
 | BAGEL | Checkpoint-defined diffusion or transformer stage | Not validated | Requires a compatible AutoRound checkpoint |
 | GLM-Image | Checkpoint-defined diffusion or transformer stage | Not validated | Requires a compatible AutoRound checkpoint |


david6666666 · 2026-05-19T09:23:41Z

@yenuo26 ptal thx

yenuo26 · 2026-05-19T09:28:38Z

+quantization configs for W4A16/AutoRound quantization support.
+"""
+
+from unittest.mock import MagicMock


It is recommended to use pytest mock.

yenuo26 · 2026-05-19T09:29:44Z

+    if stage_config_path:
+        gen_kwargs["stage_configs_path"] = stage_config_path
+
+    with OmniRunner(model_name, seed=42, **gen_kwargs) as runner:


maybe you can use omni_runner fixture

yenuo26 · 2026-05-19T09:30:21Z

+    first_output = outputs[0]
+    assert first_output.final_output_type == "image"
+    req_out = first_output.request_output
+    assert isinstance(req_out, OmniRequestOutput) and hasattr(req_out, "images")


It is more suitable to be placed in nightly.
1.Please rename the script to xxxx_expansion.py.
2.please modify advanced_model to full_model
3.please add this test in test-nightly.yml

yenuo26 · 2026-05-19T09:31:50Z

+``transformers.models.glm_image`` at module init, which may not be available
+in all environments.
+"""
+


please add pytest.mark.xxxxx

lvliang-intel requested a review from hsliuustc0106 as a code owner April 23, 2026 06:32

lvliang-intel changed the title ~~Feats/ar w4a16 glm image~~ [AutoRound] Support GLM-Image W4A16 quantization model Apr 23, 2026

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from f4723c4 to e194970 Compare April 23, 2026 07:14

lvliang-intel mentioned this pull request Apr 23, 2026

[vllm-omni]: Omni Quant Support intel/auto-round#1507

Open

hsliuustc0106 reviewed Apr 23, 2026

View reviewed changes

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 5f0f515 to ef8ec3f Compare April 23, 2026 14:07

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

lvliang-intel added 6 commits April 25, 2026 18:41

support glm-image w4a16 with autoround

e214376

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

add e2e test

8eb13c5

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix lint

b8a8301

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix pre commit

cc6f7d9

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

update doc

2d8838b

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

add ut

d6af213

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from ef8ec3f to d6af213 Compare April 25, 2026 10:41

fix pre commit

91920fc

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread tests/e2e/offline_inference/test_glm_image_autoround_w4a16.py

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 91920fc to 0bf6bbb Compare May 4, 2026 12:38

yiliu30 mentioned this pull request May 7, 2026

[RFC]: Intel Auto-Round x vLLM-Omni Quantization Support (2026 H1) #1325

Open

3 tasks

david6666666 mentioned this pull request May 8, 2026

[RFC] [0.22.0]: Quantization Support JiusiServe/vllm-omni#182

Open

8 tasks

Gaohan123 added this to the v0.22.0 milestone May 11, 2026

lvliang-intel mentioned this pull request May 13, 2026

[Feature]: Load/Evaluate W4A16 zai-org/GLM-Image on vllm-omni intel/auto-round#1510

Closed

2 tasks

Merge branch 'main' of https://github.com/lvliang-intel/vllm-omni int…

0ffcc2a

…o feats/ar-w4a16-glm-image Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 0bf6bbb to 0ffcc2a Compare May 19, 2026 02:14

lvliang-intel requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride, wtomin and yenuo26 as code owners May 19, 2026 02:14

david6666666 reviewed May 19, 2026

View reviewed changes

yenuo26 reviewed May 19, 2026

View reviewed changes


	\| GLM-Image \| Checkpoint-defined diffusion or transformer stage \| ✅ \| AutoRound checkpoint name \|

Conversation

lvliang-intel commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TIIF-Bench Accuracy (9 Sub-Attributes Average)

Model Size Reduction

E2E Generation Smoke Test

Performance Test Result on A100

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

zhumingjue138 commented Apr 24, 2026

Uh oh!

lvliang-intel commented Apr 25, 2026

Uh oh!

lishunyang12 commented May 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented May 4, 2026

Uh oh!

david6666666 commented May 18, 2026

Uh oh!

david6666666 May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david6666666 commented May 19, 2026

Uh oh!

yenuo26 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lvliang-intel commented Apr 23, 2026 •

edited

Loading

david6666666 May 19, 2026 •

edited

Loading